{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Feature Engineering Notebook\n",
    "\n",
    "Clean and extract features from raw data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "jp-MarkdownHeadingCollapsed": true,
    "pycharm": {
     "name": "#%% md\n"
    },
    "tags": []
   },
   "source": [
    "# Steps\n",
    "\n",
    "1. Split the data into training and test data set\n",
    "1. Clean the data (transform null values)\n",
    "1. Scale necessary attributes (normalization, standardization)\n",
    "1. Save transformed data for model training\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Import packages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# data manipulation\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "# data splitting\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# data preprocessing\n",
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.impute import SimpleImputer\n",
    "\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "\n",
    "# serializing, compressing, and loading the models\n",
    "import joblib\n",
    "import sys\n",
    "sys.path.append(\"../lib\")\n",
    "\n",
    "from getConfig import *\n",
    "config = getConfig(\"../\")\n",
    "config.cleanup(config.traintest_path)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Load the data\n",
    "\n",
    "Load comma separated data from disk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "data = pd.read_csv(config.input_data, sep=\",\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Create Training and Test Dataset\n",
    "> uses scikit-learn\n",
    "\n",
    "Performing this early minimizes generalization and bias you may inadvertently apply to your system.\n",
    "Simply put, a test set of data involves: picking ~20% of the instances randomly and setting them aside.\n",
    "\n",
    "Some considerations for sampling methods that generate the test set:\n",
    "1. you don't want your model to see the entire dataset\n",
    "1. you want to be able to fetch new data for training\n",
    "1. you want to maintain the same percentage of training data against the entire dataset\n",
    "1. you want a representative training dataset (~7% septic positive)\n",
    "\n",
    "https://realpython.com/train-test-split-python-data/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# sets 10%/15%/20% of the data aside for testing, sets the random number generate to it always generates the same shuffled indicies\n",
    "# x = 2 dimensional array with inputs\n",
    "# X_train is the training part of the first sequence (x)\n",
    "# X_test is the test part of the first sequence (x)\n",
    "# y = 1 dimensional array with outputs\n",
    "# y_train is the labeled training part of the second sequence\n",
    "# y_test is the labeled test part of the second sequence\n",
    "# axis Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’)\n",
    "# test_size is the amount of the total dataset to set aside for testing = 10%\n",
    "# random state fixes the randomization so you get the same results each time\n",
    "# Shuffle before the data is split, it is shuffled\n",
    "# stratified splitting keeps the proportion of y values trhough the train and test sets\n",
    "X_train, X_test, y_train, y_test = \\\n",
    "    train_test_split(data.drop([\"Age\", \"Unit1\", \"Unit2\", \"HospAdmTime\", \"ICULOS\", \"Gender\", \"Bilirubin_direct\", \"TroponinI\", \"isSepsis\"], axis=1),\n",
    "    data[\"isSepsis\"], test_size=0.20,\n",
    "    random_state=42, stratify=data[\"isSepsis\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Clean the Data\n",
    "Instead of preparing data manually, write functions to:\n",
    "1. reproduce transformations easily on any dataset (e.g., data refresh)\n",
    "1. builds a library of functions to reuse in future projects\n",
    "1. use functions in live stream to transform new data before inferencing\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Steps\n",
    "1. transform current and future null values\n",
    "1. impute median for missing attributes (>7k)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1 Transform missing values from numeric data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 80.  ,  99.  ,  36.89, ...,  15.  , 243.  , 341.  ],\n",
       "       [ 84.  ,  95.  ,  37.06, ...,  13.5 , 243.  , 292.  ],\n",
       "       [ 78.  , 100.  ,  36.89, ...,   7.5 , 243.  , 135.  ],\n",
       "       ...,\n",
       "       [ 76.  ,  98.  ,  36.89, ...,   6.2 , 243.  ,  77.  ],\n",
       "       [ 80.  , 100.  ,  36.4 , ...,  25.7 , 243.  , 196.  ],\n",
       "       [100.5 ,  99.  ,  36.89, ...,   0.6 ,  77.  , 172.  ]])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# create simpleimputer instance\n",
    "# replace attributes missing values with median of the attribute\n",
    "imputer = SimpleImputer(strategy=\"median\")\n",
    "\n",
    "# fit applies the imputer to ALL numeric data in case new data includes null values\n",
    "# when system goes live\n",
    "# results are stored in a imputer.statistics_ value\n",
    "imputer.fit_transform(X_train)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>HR</th>\n",
       "      <th>O2Sat</th>\n",
       "      <th>Temp</th>\n",
       "      <th>SBP</th>\n",
       "      <th>MAP</th>\n",
       "      <th>DBP</th>\n",
       "      <th>Resp</th>\n",
       "      <th>EtCO2</th>\n",
       "      <th>BaseExcess</th>\n",
       "      <th>HCO3</th>\n",
       "      <th>...</th>\n",
       "      <th>Magnesium</th>\n",
       "      <th>Phosphate</th>\n",
       "      <th>Potassium</th>\n",
       "      <th>Bilirubin_total</th>\n",
       "      <th>Hct</th>\n",
       "      <th>Hgb</th>\n",
       "      <th>PTT</th>\n",
       "      <th>WBC</th>\n",
       "      <th>Fibrinogen</th>\n",
       "      <th>Platelets</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>80.0</td>\n",
       "      <td>99.0</td>\n",
       "      <td>36.89</td>\n",
       "      <td>129.0</td>\n",
       "      <td>100.00</td>\n",
       "      <td>77.0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>22.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1.8</td>\n",
       "      <td>2.8</td>\n",
       "      <td>3.8</td>\n",
       "      <td>0.7</td>\n",
       "      <td>28.0</td>\n",
       "      <td>9.3</td>\n",
       "      <td>25.8</td>\n",
       "      <td>15.0</td>\n",
       "      <td>243.0</td>\n",
       "      <td>341.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1084</th>\n",
       "      <td>84.0</td>\n",
       "      <td>95.0</td>\n",
       "      <td>37.06</td>\n",
       "      <td>134.0</td>\n",
       "      <td>84.00</td>\n",
       "      <td>57.5</td>\n",
       "      <td>17.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>22.0</td>\n",
       "      <td>...</td>\n",
       "      <td>2.5</td>\n",
       "      <td>5.1</td>\n",
       "      <td>4.1</td>\n",
       "      <td>0.7</td>\n",
       "      <td>31.6</td>\n",
       "      <td>10.7</td>\n",
       "      <td>121.5</td>\n",
       "      <td>13.5</td>\n",
       "      <td>243.0</td>\n",
       "      <td>292.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>414</th>\n",
       "      <td>78.0</td>\n",
       "      <td>100.0</td>\n",
       "      <td>36.89</td>\n",
       "      <td>129.0</td>\n",
       "      <td>80.00</td>\n",
       "      <td>57.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>24.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1.8</td>\n",
       "      <td>2.7</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.7</td>\n",
       "      <td>23.6</td>\n",
       "      <td>8.0</td>\n",
       "      <td>30.3</td>\n",
       "      <td>7.5</td>\n",
       "      <td>243.0</td>\n",
       "      <td>135.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>437</th>\n",
       "      <td>79.5</td>\n",
       "      <td>99.0</td>\n",
       "      <td>37.00</td>\n",
       "      <td>114.5</td>\n",
       "      <td>87.50</td>\n",
       "      <td>71.5</td>\n",
       "      <td>31.5</td>\n",
       "      <td>2.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>25.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1.7</td>\n",
       "      <td>3.3</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.7</td>\n",
       "      <td>32.9</td>\n",
       "      <td>11.3</td>\n",
       "      <td>31.1</td>\n",
       "      <td>18.4</td>\n",
       "      <td>243.0</td>\n",
       "      <td>268.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>671</th>\n",
       "      <td>102.0</td>\n",
       "      <td>98.0</td>\n",
       "      <td>36.44</td>\n",
       "      <td>122.0</td>\n",
       "      <td>70.67</td>\n",
       "      <td>57.5</td>\n",
       "      <td>16.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>24.0</td>\n",
       "      <td>...</td>\n",
       "      <td>1.9</td>\n",
       "      <td>2.3</td>\n",
       "      <td>3.8</td>\n",
       "      <td>3.3</td>\n",
       "      <td>28.2</td>\n",
       "      <td>9.3</td>\n",
       "      <td>38.5</td>\n",
       "      <td>11.0</td>\n",
       "      <td>243.0</td>\n",
       "      <td>160.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 32 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         HR  O2Sat   Temp    SBP     MAP   DBP  Resp  EtCO2  BaseExcess  HCO3  \\\n",
       "23     80.0   99.0  36.89  129.0  100.00  77.0  18.0    2.0         0.0  22.0   \n",
       "1084   84.0   95.0  37.06  134.0   84.00  57.5  17.0    2.0         0.0  22.0   \n",
       "414    78.0  100.0  36.89  129.0   80.00  57.0  10.0    2.0         0.0  24.0   \n",
       "437    79.5   99.0  37.00  114.5   87.50  71.5  31.5    2.0         1.0  25.0   \n",
       "671   102.0   98.0  36.44  122.0   70.67  57.5  16.0    2.0         0.0  24.0   \n",
       "\n",
       "      ...  Magnesium  Phosphate  Potassium  Bilirubin_total   Hct   Hgb  \\\n",
       "23    ...        1.8        2.8        3.8              0.7  28.0   9.3   \n",
       "1084  ...        2.5        5.1        4.1              0.7  31.6  10.7   \n",
       "414   ...        1.8        2.7        4.0              0.7  23.6   8.0   \n",
       "437   ...        1.7        3.3        4.0              0.7  32.9  11.3   \n",
       "671   ...        1.9        2.3        3.8              3.3  28.2   9.3   \n",
       "\n",
       "        PTT   WBC  Fibrinogen  Platelets  \n",
       "23     25.8  15.0       243.0      341.0  \n",
       "1084  121.5  13.5       243.0      292.0  \n",
       "414    30.3   7.5       243.0      135.0  \n",
       "437    31.1  18.4       243.0      268.0  \n",
       "671    38.5  11.0       243.0      160.0  \n",
       "\n",
       "[5 rows x 32 columns]"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# apply the trained imputer to transform the training set replacing the\n",
    "# missing values with learn medians\n",
    "N = imputer.transform(X_train)\n",
    "# result above is plain NumPy array with transformed features\n",
    "# put back to a pandas DataFrame\n",
    "M = pd.DataFrame(N, columns=X_train.columns, index=X_train.index)\n",
    "M.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "pycharm": {
     "name": "#%% md\n"
    }
   },
   "source": [
    "# 3 Feature Scaling\n",
    "1. ML algorithms don't work well when numeric attributes have very different scales\n",
    "    (e.g. HR max 184,  pH max 7.67)\n",
    "1. Scaling target values is not necessary\n",
    "1. Apply\n",
    "    1. normalization (MinMaxScaler) bounds the values to a specific range (e.g. 0-1)\n",
    "    1. standardization (StandardScaler) less affected by outliers does not bound to range"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>HR</th>\n",
       "      <th>O2Sat</th>\n",
       "      <th>Temp</th>\n",
       "      <th>SBP</th>\n",
       "      <th>MAP</th>\n",
       "      <th>DBP</th>\n",
       "      <th>Resp</th>\n",
       "      <th>EtCO2</th>\n",
       "      <th>BaseExcess</th>\n",
       "      <th>HCO3</th>\n",
       "      <th>...</th>\n",
       "      <th>Magnesium</th>\n",
       "      <th>Phosphate</th>\n",
       "      <th>Potassium</th>\n",
       "      <th>Bilirubin_total</th>\n",
       "      <th>Hct</th>\n",
       "      <th>Hgb</th>\n",
       "      <th>PTT</th>\n",
       "      <th>WBC</th>\n",
       "      <th>Fibrinogen</th>\n",
       "      <th>Platelets</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>23</th>\n",
       "      <td>-0.232464</td>\n",
       "      <td>0.510396</td>\n",
       "      <td>0.024048</td>\n",
       "      <td>0.547440</td>\n",
       "      <td>1.540106</td>\n",
       "      <td>2.031346</td>\n",
       "      <td>0.023994</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.043109</td>\n",
       "      <td>-0.579576</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.540031</td>\n",
       "      <td>-0.776203</td>\n",
       "      <td>-0.610075</td>\n",
       "      <td>-0.142612</td>\n",
       "      <td>-0.715562</td>\n",
       "      <td>-0.858904</td>\n",
       "      <td>-0.499411</td>\n",
       "      <td>0.620465</td>\n",
       "      <td>-0.064134</td>\n",
       "      <td>1.276695</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1084</th>\n",
       "      <td>0.008067</td>\n",
       "      <td>-0.879637</td>\n",
       "      <td>0.287529</td>\n",
       "      <td>0.800337</td>\n",
       "      <td>0.437076</td>\n",
       "      <td>-0.105018</td>\n",
       "      <td>-0.166885</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.043109</td>\n",
       "      <td>-0.579576</td>\n",
       "      <td>...</td>\n",
       "      <td>1.448641</td>\n",
       "      <td>1.512730</td>\n",
       "      <td>-0.063418</td>\n",
       "      <td>-0.142612</td>\n",
       "      <td>0.009165</td>\n",
       "      <td>-0.036622</td>\n",
       "      <td>5.309965</td>\n",
       "      <td>0.349054</td>\n",
       "      <td>-0.064134</td>\n",
       "      <td>0.786204</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>414</th>\n",
       "      <td>-0.352730</td>\n",
       "      <td>0.857904</td>\n",
       "      <td>0.024048</td>\n",
       "      <td>0.547440</td>\n",
       "      <td>0.161319</td>\n",
       "      <td>-0.159797</td>\n",
       "      <td>-1.503036</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.043109</td>\n",
       "      <td>-0.077481</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.540031</td>\n",
       "      <td>-0.875722</td>\n",
       "      <td>-0.245637</td>\n",
       "      <td>-0.142612</td>\n",
       "      <td>-1.601341</td>\n",
       "      <td>-1.622452</td>\n",
       "      <td>-0.226243</td>\n",
       "      <td>-0.736588</td>\n",
       "      <td>-0.064134</td>\n",
       "      <td>-0.785371</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>437</th>\n",
       "      <td>-0.262531</td>\n",
       "      <td>0.510396</td>\n",
       "      <td>0.194536</td>\n",
       "      <td>-0.185962</td>\n",
       "      <td>0.678364</td>\n",
       "      <td>1.428782</td>\n",
       "      <td>2.600857</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.387765</td>\n",
       "      <td>0.173567</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.824127</td>\n",
       "      <td>-0.278609</td>\n",
       "      <td>-0.245637</td>\n",
       "      <td>-0.142612</td>\n",
       "      <td>0.270873</td>\n",
       "      <td>0.315785</td>\n",
       "      <td>-0.177680</td>\n",
       "      <td>1.235662</td>\n",
       "      <td>-0.064134</td>\n",
       "      <td>0.545963</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>671</th>\n",
       "      <td>1.090455</td>\n",
       "      <td>0.162888</td>\n",
       "      <td>-0.673403</td>\n",
       "      <td>0.193384</td>\n",
       "      <td>-0.481885</td>\n",
       "      <td>-0.105018</td>\n",
       "      <td>-0.357763</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.043109</td>\n",
       "      <td>-0.077481</td>\n",
       "      <td>...</td>\n",
       "      <td>-0.255935</td>\n",
       "      <td>-1.273797</td>\n",
       "      <td>-0.610075</td>\n",
       "      <td>1.593540</td>\n",
       "      <td>-0.675300</td>\n",
       "      <td>-0.858904</td>\n",
       "      <td>0.271530</td>\n",
       "      <td>-0.103297</td>\n",
       "      <td>-0.064134</td>\n",
       "      <td>-0.535120</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 32 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "            HR     O2Sat      Temp       SBP       MAP       DBP      Resp  \\\n",
       "23   -0.232464  0.510396  0.024048  0.547440  1.540106  2.031346  0.023994   \n",
       "1084  0.008067 -0.879637  0.287529  0.800337  0.437076 -0.105018 -0.166885   \n",
       "414  -0.352730  0.857904  0.024048  0.547440  0.161319 -0.159797 -1.503036   \n",
       "437  -0.262531  0.510396  0.194536 -0.185962  0.678364  1.428782  2.600857   \n",
       "671   1.090455  0.162888 -0.673403  0.193384 -0.481885 -0.105018 -0.357763   \n",
       "\n",
       "      EtCO2  BaseExcess      HCO3  ...  Magnesium  Phosphate  Potassium  \\\n",
       "23      0.0    0.043109 -0.579576  ...  -0.540031  -0.776203  -0.610075   \n",
       "1084    0.0    0.043109 -0.579576  ...   1.448641   1.512730  -0.063418   \n",
       "414     0.0    0.043109 -0.077481  ...  -0.540031  -0.875722  -0.245637   \n",
       "437     0.0    0.387765  0.173567  ...  -0.824127  -0.278609  -0.245637   \n",
       "671     0.0    0.043109 -0.077481  ...  -0.255935  -1.273797  -0.610075   \n",
       "\n",
       "      Bilirubin_total       Hct       Hgb       PTT       WBC  Fibrinogen  \\\n",
       "23          -0.142612 -0.715562 -0.858904 -0.499411  0.620465   -0.064134   \n",
       "1084        -0.142612  0.009165 -0.036622  5.309965  0.349054   -0.064134   \n",
       "414         -0.142612 -1.601341 -1.622452 -0.226243 -0.736588   -0.064134   \n",
       "437         -0.142612  0.270873  0.315785 -0.177680  1.235662   -0.064134   \n",
       "671          1.593540 -0.675300 -0.858904  0.271530 -0.103297   -0.064134   \n",
       "\n",
       "      Platelets  \n",
       "23     1.276695  \n",
       "1084   0.786204  \n",
       "414   -0.785371  \n",
       "437    0.545963  \n",
       "671   -0.535120  \n",
       "\n",
       "[5 rows x 32 columns]"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "scaler = StandardScaler()\n",
    "\n",
    "O = scaler.fit_transform(N)\n",
    "P = pd.DataFrame(O, columns=X_train.columns, index=X_train.index)\n",
    "P.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.1 Transformation Pipeline\n",
    "\n",
    "Common to apply many transformation steps in a specific order (fill the nulls before you apply the scaling)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# this pipeline should work for all the estimators/algorithms\n",
    "pipeline = Pipeline([\n",
    "                    ('imputer', SimpleImputer(strategy='median')),\n",
    "                    ('std_scaler', StandardScaler()),\n",
    "                    ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# this is the transformed data to train from\n",
    "X_train_prepared = pipeline.fit_transform(X_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# neural networks sometimes expect a 0-1 normalized scale and perform better\n",
    "pipeline_minmax = Pipeline([\n",
    "                    ('imputer', SimpleImputer(strategy='median')),\n",
    "                    ('minMax', MinMaxScaler()),\n",
    "                    ])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# this is the transformed data to train the MLP from\n",
    "X_train_prepared_m = pipeline_minmax.fit_transform(X_train)\n",
    "X_test_prepared=pipeline_minmax.fit_transform(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Save the data for model training\n",
    "\n",
    "Common to apply many transformation steps in a specific order (fill the nulls before you apply the scaling)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false,
    "jupyter": {
     "outputs_hidden": false
    },
    "pycharm": {
     "name": "#%%\n"
    }
   },
   "outputs": [],
   "source": [
    "# compress and save the pipeline\n",
    "\n",
    "joblib.dump(pipeline, config.traintest_path + \"pipeline.pkl\")\n",
    "joblib.dump(pipeline_minmax, config.traintest_path + \"pipeline_minmax.pkl\")\n",
    "\n",
    "#Save the transformed data into data/transform folder\n",
    "\n",
    "np.savetxt(config.traintest_path + \"X_train_prepared_m.csv\", X_train_prepared_m, delimiter=\",\")\n",
    "np.savetxt(config.traintest_path + \"X_train_prepared.csv\", X_train_prepared, delimiter=\",\")\n",
    "np.savetxt(config.traintest_path + \"X_train.csv\", X_train, delimiter=\",\")\n",
    "np.savetxt(config.traintest_path + \"X_test.csv\", X_test, delimiter=\",\")\n",
    "np.savetxt(config.traintest_path + \"X_test_prepared.csv\", X_test_prepared, delimiter=\",\")\n",
    "np.savetxt(config.traintest_path + \"y_train.csv\", y_train, delimiter=\",\")\n",
    "np.savetxt(config.traintest_path + \"y_test.csv\", y_test, delimiter=\",\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
